Skip to content

fix: oracle chokepoint guard + effective_model helper#174

Merged
xdotli merged 4 commits intobenchflow-ai:mainfrom
EYH0602:fix/oracle-chokepoint-and-cleanup
Apr 22, 2026
Merged

fix: oracle chokepoint guard + effective_model helper#174
xdotli merged 4 commits intobenchflow-ai:mainfrom
EYH0602:fix/oracle-chokepoint-and-cleanup

Conversation

@EYH0602
Copy link
Copy Markdown
Contributor

@EYH0602 EYH0602 commented Apr 21, 2026

Summary

Follow-up to #173. The reviewer flagged that the fix landed in src/benchflow/cli/eval.py, but bench eval create actually dispatches to src/benchflow/cli/main.py:707cli/eval.py is not wired into the live CLI. On top of that, #173 deliberately removed the chokepoint guard from _agent_env.py:147 on the assumption that the CLI would pre-filter, so oracle is now more broken on the live code path: bench eval create -a oracle raises ValueError("ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001' but not set …") even though oracle never calls an LLM.

Three layers, one PR:

  • Layer 1 — restore the chokepoint (_agent_env.py:147). Put back if model and agent != "oracle":. Single point of defense; catches every caller (current + future, CLI + SDK + YAML).
  • Layer 2 — clarify the not-yet-live module. Keep src/benchflow/cli/eval.py and tests/test_eval_cli.py (future-facing eval design), but add docstring notes that they are not wired into the live CLI. Update eval.py to use the effective_model() helper for the oracle guard. A regression test ensures cli/main.py does not import cli/eval.
  • Layer 3 — kill the structural pressure. New helper effective_model(agent, model) in job.py. Change JobConfig.model default to None. Replace all 7 model or DEFAULT_MODEL sites in cli/main.py and job.py YAML loaders with the helper. Oracle gets honest model=None end-to-end (in result.json, summary.json, prints).

Regression tests

tests/test_resolve_env_helpers.py::TestResolveAgentEnvOracle pins the chokepoint at unit level. tests/test_oracle_chokepoint.py (13 tests) pins each layer at the right altitude:

Class What it pins
TestEvalModuleNotWiredIntoCLI cli/main.py does not import cli/eval
TestEvalCreateRouting bench eval createcli/main.py:eval_create
TestEffectiveModel helper behaviour for oracle / non-oracle / empty model
TestOracleYamlLoaders Job.from_yaml(oracle config) → model is None (native + Harbor)
TestEvalCreateOracleCLI end-to-end: live eval_create(agent="oracle") + no API key → no raise

Test plan

  • pytest tests/test_oracle_chokepoint.py → 13 passed
  • pytest tests/test_eval_cli.py → 7 passed
  • pytest tests/test_resolve_env_helpers.py → all passed
  • Full suite: same pre-existing failures as main (Docker not running locally). Zero new failures.
  • ty check src/: same pre-existing diagnostics as main. Zero new ones.

Open in Devin Review

EYH0602 added 2 commits April 21, 2026 15:32
PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
@EYH0602 EYH0602 marked this pull request as draft April 21, 2026 22:48
EYH0602 added 2 commits April 21, 2026 15:53
The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).
@EYH0602 EYH0602 marked this pull request as ready for review April 21, 2026 22:55
@EYH0602 EYH0602 changed the title fix: oracle chokepoint guard + drop orphan eval CLI + effective_model helper fix: oracle chokepoint guard + effective_model helper Apr 21, 2026
Copy link
Copy Markdown
Contributor Author

@EYH0602 EYH0602 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Reviewed 7 files, 346 lines added / 28 deleted across bug detection, type design, test coverage, and guidelines compliance.

Critical Issues (confidence 90-100)

None.

Important Issues (confidence 75-89)

None.

All findings scored below the 75 confidence threshold — no actionable issues found.

Positive Observations

  • Three-layer defense is excellent architecture. Layer 1 (chokepoint guard in resolve_agent_env), Layer 2 (docstring/structural clarification on the not-yet-live cli/eval.py), and Layer 3 (effective_model() helper) each defend independently — any single layer failing still leaves the others protecting.

  • effective_model() eliminates 9 duplicate model or DEFAULT_MODEL sites across cli/main.py, cli/eval.py, and the YAML loaders in job.py. Centralizing the rule "oracle → None, else → model or default" in one helper is a meaningful duplication reduction.

  • Test layering matches the code layering. Tests pin each defense at the right altitude: TestResolveAgentEnvOracle pins Layer 1, TestEvalModuleNotWiredIntoCLI pins Layer 2 structurally, TestEffectiveModel pins the helper, TestOracleYamlLoaders pins the YAML entry points, and TestEvalCreateOracleCLI pins the full CLI → chokepoint path end-to-end. 13 new tests for a 3-line core change is proportionate to the bug's subtlety.

  • Structural regression test is clever. test_cli_main_does_not_import_cli_eval reads the source text of cli/main.py and asserts no cli/eval import — this directly prevents the mistake that made PR #173's fix land in dead code.

  • JobConfig.model default → None is clean. Forcing callers through effective_model() means config objects carry honest data (oracle shows model=null in summary.json, not a bogus default). All internal callers were updated.

  • Edge cases covered. Empty string model from Harbor YAML, explicit model forwarded to oracle, non-oracle fallback to DEFAULT_MODEL — all verified in tests.

Verified: all 9 model or DEFAULT_MODEL sites in cli/main.py + cli/eval.py are replaced, both YAML loaders in job.py route through the helper, and DEFAULT_MODEL is only imported where still needed. No remaining unguarded call sites. LGTM — request an independent reviewer per CLAUDE.md policy.

@EYH0602
Copy link
Copy Markdown
Contributor Author

EYH0602 commented Apr 21, 2026

@xdotli this fix should do the work

@xdotli xdotli self-requested a review April 22, 2026 00:49
@xdotli
Copy link
Copy Markdown
Member

xdotli commented Apr 22, 2026

@xdotli this fix should do the work

waiting for @devin to review it

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Comment thread src/benchflow/cli/eval.py

The future-facing entry point for running evaluations. Anthropic-style shape:
resource creation, one command, return the result or a job-id.
NOTE: This module is **not wired into the live CLI**. The active
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude is this comment good?

@xdotli xdotli merged commit 144b6dc into benchflow-ai:main Apr 22, 2026
1 of 2 checks passed
@EYH0602 EYH0602 deleted the fix/oracle-chokepoint-and-cleanup branch April 22, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants